<html>
<head>
<title>Mike Shal - CS798 Proposal</title>
</head>
<body>
<ol>
<li><h2>Introduction</h2></li>
<p>This proposal describes the project I would like to undertake towards credit for a Masters of Science in Computer Science. The main goal of the project will be to design and implement a build system (like the UNIX <em>make</em> program, for example) that is able to scale to large sized projects. The sample projects and analysis will assume the projects are written in C and compiled with gcc, though it is hoped that the result will not be language-dependent. In the following sections I will give some background information about the history and current state of build systems, and why I believe they are insufficient for large-scale development. Then I will describe the separate components of the proposed build system. Finally, I will discuss the long-term goals of the program if it is successful.</p>

<li><h2>Background</h2></li>
<p>The need for a build tool becomes evident during the development of any project that grows to span multiple source files, libraries, and directories. The developer typically focuses on a small subset of the project, and even with just a few source files it is more efficient to re-compile only the changes than it is to re-compile the whole project. Parts of the project are dependent on other parts, however. For example, multiple source files can include the same header. A change to the header requires that each source file that includes it is re-built. Similarly, if an archive is re-built, then all programs that link to the archive must be re-linked. All of these dependencies must be tracked by the build program to ensure a consistent and correct build.</p>
<p>The common UNIX program <em>make</em> is a build tool that allows a developer to specify dependencies, and commands to rebuild targets. When executed, <em>make</em> will perform two main functions: 1) construct a DAG (directed acyclic graph) of all dependencies, and 2) traverse the graph starting at the requested target, rebuilding only those targets that are out of date with respect to their dependencies. In this way <em>make</em> can be used to build an entire project from scratch, or build only the portions of the project that have been affected by changes.</p>
<p>Unfortunately, <em>make</em> does not scale well for large projects when building only the parts of the project that are affected by changes. Specifically, the update operation is at best an O(n) algorithm, where <em>n</em> is the number of dependencies. This is undesirable, because unrelated pieces of the project affect the build time of the part of the project we may be working on. Consider just the effect of reading an increasing number of dependency files. A Makefile that includes all of these dependency files and simply checks if the files are up-to-date (without building anything) results in the build times shown in the following table. Notice how it scales in a mostly linear fashion, until cache effects start to dominate the large number of files.</p>
<table border=1>
<tr><td bgcolor="#BBFFFF"><b># Files</b></td>
<td>1</td>
<td>2</td>
<td>4</td>
<td>8</td>
<td>16</td>
<td>32</td>
<td>64</td>
<td>128</td>
<td>256</td>
<td>512</td>
<td>1024</td>
<td>2048</td>
<td>4096</td>
<td>8192</td>
<td>16384</td>
<td>32768</td>
<td>65536</td>
</tr>
<tr><td bgcolor="#BBFFFF"><b>Time</b></td>
<td>0.004s</td>
<td>0.005s</td>
<td>0.006s</td>
<td>0.008s</td>
<td>0.012s</td>
<td>0.019s</td>
<td>0.035s</td>
<td>0.064s</td>
<td>0.128s</td>
<td>0.257s</td>
<td>0.528s</td>
<td>1.093s</td>
<td>2.434s</td>
<td>5.554s</td>
<td>16.035s</td>
<td>1m21.539s</td>
<td>5m53.272s</td>
</tr>
</table>

<p>This behavior is not specific to <em>make</em>. There are several complaints (some legitimate, some not) with <em>make</em> or its Makefiles, such as the fact that: Makefiles are their own sort of language (as opposed to something already known like Perl or Python), automatic dependency handling is not builtin to the program, or it is difficult maintaining a project across multiple directories. As a result, a number of alternatives to <em>make</em> have been created. For example, <a href="http://www.dsmit.com/cons/">CONS</a>, <a href="http://www.scons.org/">SCONS</a>, <a href="http://www.perforce.com/jam/jam.html">JAM</a>, <a href="http://www.a-a-p.org/">A-A-P</a>, and <a href="http://omake.metaprl.org/index.html">OMake</a>, among others. All of these programs suffer from the same linear update time. Ideally, the time to process an update would be proportional to the amount of changes required. The current linear behavior is actually a result of three separate factors:
<ol>
  <li><em>make</em> always reads the entire DAG before rebuilding anything.</li>
  <li><em>make</em> does not know which files have been updated before-hand; instead, it considers each target and checks to see if it is out-of-date with respect to its dependencies.</li>
  <li>The storage of the dependencies in the filesystem (as generated by gcc's <em>-MD</em> option, or older equivalents such as <em>makedepend</em>) makes it necessary that any program must read every dependency file before updating.</li>
</ol>

<p>We will now consider each of these factors in more detail. First, <em>make</em> always reads in the entire DAG before updating anything. Since each edge must be added to the DAG, it is easy to see that even if we ignore any other processing, constructing the DAG is at best a linear operation. As such, any build program that could hope to improve on a linear time algorithm must not rely on the entire DAG. Instead, it should construct only the portion of the graph that it needs based on the changes to the system. This is difficult because of the second and third factors mentioned above.</p>
<p>When performing an update, the <em>make</em> program essentially starts with a single target (such as 'all'), and asks the question <i>Do I need to update this target?</i>. This question can only be answered by checking the timestamps of each of its dependencies, and each of their dependencies, and so on through the DAG. (Other build programs may use MD5 sums or other hashes instead of timestamps, but this is irrelevant to the complexity of the algorithm). Again, this is at best a linear operation. While developing in a project, however, we don't really care if 'all' is updated or not. What we care about is that anything dependent on the immediate changes we have made (such as a .c file we modified) is updated. A better question to ask is: <i>What files need to be updated given that these files have changed?</i> Answering this question assumes we had a list of files that were changed up front. This is not currently provided to the build program.</p>
<p>Even if we had a list of files that were changed up front, any build tool is again limited to a linear-time algorithm because of the way the dependencies are structured in the filesystem. Consider the following minimal example:</p>

<pre>
Makefile
main/
    main.c
    Makefile
lib/
    lib.c
    lib.h
    Makefile
</pre>
<p>Such a program may have the following dependency information:</p>
<table border=1><caption><i>Figure 1: Simple Program Dependencies</i></caption><tr><td><img src="01-initial.png"></td></tr></table>

<p>The dependencies are actually stored in several different places. The two dependencies on the header file are output by gcc (using the <em>-MD</em> or similar option) the first time the program is built. This information can be used on subsequent builds to re-build both main.o and lib.o if the header changes. The edges from the .o to the .c files are generally written in the Makefiles as implicit rules (such as <em>%.o: %.c</em>). These dependencies are also written by gcc in the .d file. The following graph shows the same program along with where the actual edges are found:</p>
<table border=1><caption><i>Figure 2: Dependencies File Locations</i></caption><tr><td><img src="02-location.png"></td></tr></table>

<p>Dependencies that are from the same file are shown in the same color. This will be explained later. For now, consider we are developing the interaction between the main program and the library. This will likely include changing the lib.h file. When this file is changed, both lib.o and main.o must be rebuilt, the archive must be re-created, and ultimately the main executable must be linked. Now let's assume that this is actually a small part of a much larger project:</p>

<table border=1><caption><i>Figure 3: Large Dependency Graph</i></caption><tr><td><img src="03-large.png"></td></tr></table>

<p>Suppose we are still only developing the interaction between the main program and the library. However, now any changes we make to the library must also cause a rebuild of another (or possibly multiple) other binaries. In order to determine which pieces must be rebuilt, the build tool must read in all of the dependency files and find the dependencies on lib.h. Notice how the incident edges to the lib.h node are all separate colors. This indicates that they are all stored in separate files. So if we try to answer the basic question <i>What files must be updated given that lib.h has changed?</i>, the program must necessarily read in every dependency file, since we have no way of knowing which ones might contain an edge to lib.h. This means if we specify dependencies in this manner, no matter what program we use, we will be forced to use at best a linear update algorithm. The consequence of this fact is that any build program that relies on the output of gcc's dependency mechanism can perform no better than a linear-time update.</p>


<li><h2>Goals</h2></li>
<p>The main goal of this project is to create a build tool that can perform an update in sub-linear time. Specifically, it should be possible to create a build tool that performs in O(d), where <em>d</em> is the number of objects dependent on the set of updated objects. Effectively this is constant time for a specific input, since the complexity is invariant with respect to independent objects. Based on the analysis in section 2, such a project must involve several components:</p>
<ol>
  <li>A program capable of constructing a partial DAG from a list of updated input files. This program will be called the Updater.</li>
  <li>A means of obtaining the list of updated files so the Updater can treat it as <i>a priori</i> knowledge. This will be called the Monitor.</li>
  <li>A means of storing dependencies so that only the edges required by the Updater need to be read.</li>
  <li>A means of generating the dependencies in part 3), given that those output from gcc are insufficient.</li>
</ol>

<p>Each of these parts will now be considered in more detail. However, note that since none of these pieces have actually been created at this point, it is possible that important details may be left out, or alternative approaches may be discovered.</p>
<p>First, consider the Updater program. This program is capable of creating a partial DAG, given a list of updated input files. Essentially what we'd like to do is build all parts of the DAG that link to one of the input files, and ignore the rest. In this way we'll have a complete DAG for the parts we care about, but don't have to process the entire graph. Using the initial example from before, suppose the program is provided with the fact that the file main.c has changed. In this case, the following graph would be constructed:</p>
<table border=1><caption><i>Figure 4: Partial DAG</i></caption><tr><td><img src="04-new_initial.png"></td></tr></table>
<p>They greyed-out nodes and edges are shown for completeness, and would not actually be included in the graph. The program could then follow the graph in a manner that is even simpler than that implemented in <em>make</em>. Here we know that every node in the graph must be updated, so a simple search algorithm to iterate over each node is sufficient. There is no need to check timestamps or MD5 sums in this graph search. Also note that the direction of the edges has changed. This will be explained in more detail in the third component. The Updater program will likely be independent of the platform and target language, since all it will need to do is read dependency files, construct a partial DAG, and execute commands.</p>

<p>The Monitor component must be able to determine which files have changed since the last update in order to pass the list of changes to the Updater. A simple Monitor would involve the user specifying which files have changed. Of course, this would be fairly tedious. More advanced options would likely be platform specific. On Linux, the <em>inotify</em> interface could be used to watch the entire source tree. A daemon program would then be notified when a file is changed (by the user's editor, for example) and could keep a list of these files to be passed to the Updater when necessary. The <a href="http://www.gnome.org/~veillard/gamin/">Gamin</a> program (part of GNOME) is an example of a program that uses the inotify interface to provide file monitoring capabilities.</p>

<p>The third component is not a program. Rather, it describes how dependencies must be stored to allow efficient processing by the Updater. In order to avoid linear searching to find incident edges on a changed file, the edges can simply be reversed. We can call these edges "reverse dependencies", since they effectively provide an answer to the question <i>What files are dependent on me?</i> Instead of creating a dependency (.d) file corresponding to each object (.o) file, a reverse dependency file can instead be created for each input file. For example, when building main.o, a file main.cd and a file lib.hd would be created, or updated. These files would contain the string "main.o". Similarly, the files main.od and lib.ad would be created when linking the main executable, and would contain the string "main". Essentially this means that when the Updater gets the input "main.c", it can read the file main.cd to see that main.o needs to be added to the DAG. Then it would read main.od and see that main needs to be added. The Updater could completely ignore the lib.hd, lib.cd, lib.od, and lib.ad files in this case. The following diagram shows the locations of the reverse dependencies:</p>
<table border=1><caption><i>Figure 5: Reverse Dependency File Locations</i></caption><tr><td><img src="05-reverse_deps.png"></td></tr></table>

<p>The fourth component must provide a way to generate the *d files described above, since gcc will not do this automatically. There are a number of possibilities here. Perhaps the simplest is to create a post-processing program that will run after gcc. It could read in the dependency file generated by gcc, create the new reverse dependency files, and delete the old .d files. Another option is to patch gcc and add a new flag that will create only the reverse dependency files. A third option is to use platform-specific tricks to intercept the open() calls by a program. Based on the flags of the open() call (such as whether read or write-only specified), a proper set of reverse dependency files could be generated. In other words, a file would be created for each read-only file opened(), and in it the contents of the write-only files would be added. This assumes that gcc opens the .c and .h files in read-only mode, and opens the .o file as write-only. Though platform specific, the advantage of this method would be that it would not be specific to gcc, but could apply to arbitrary programs (assuming they follow a similar open() convention). Intercepting the open() calls of the program is possible by creating a shared library and using LD_PRELOAD when launching the program.</p>

<p>Some questions are not entirely answered here, and likely won't be until the programs are implemented. For example, while creating a new file should be fairly easy to handle, what happens when deleting or moving a file? This event should be caught by the Monitor, but the Updater will need to be able to rebuild the necessary pieces, and delete reverse-dependency files that are no longer applicable. I believe this should be possible given the design above. It should be noted that current build systems generally do not properly handle these cases. Many systems expect the user to run a "clean" target of some kind to erase old files, or just leave the files around (potentially misleading the developer, or worse, causing other programs to build incorrectly).</p>

<p>In addition to the programs described above, the project will involve designing sample builds to test the system, user documentation, and performance evaluation. The performance metric will be the time to update. The Updater program can be compared against existing build systems such as <em>make</em> or <em>ant</em> using the same sample builds. If time permits, existing software may be used to provide real-world metrics.</p>

<p>With the Updater and Monitor programs, as well as the necessary changes to create the reverse dependency files, it should be possible for a developer to focus on a small part of a large project and have the build time only be affected by the complexity of the piece being edited. This should allow for a single build system to control a much larger project that is typically handled by today's Makefiles or their substitutes. The ultimate goal for such a capability is described in the next section.</p>

<li><h2>Long-term Goals</h2></li>
<p>As a user of <a href="http://www.gentoo.org/">Gentoo Linux</a>, I periodically download updated packages and install them. Since Gentoo is a source-based distribution, the packages that are downloaded must be built before being installed. For example, recently I upgraded to firefox 2.0.0.9. This package is about 35MB gzipped, and contains roughly 1500 *.c files and 4600 *.cpp files. This is a fairly large program, and consequently takes a long time to build. When Gentoo finished building the program, it installed the resulting binaries into usable system directories, and then discarded the build tree. This seems rather foolish, especially since a week or two ago I upgraded to firefox 2.0.0.8, which also involved a 35MB gzipped source tree, with about the same number of files. Those build results were also discarded. This seems even more foolish when looking at the differences between 2.0.0.8 and 2.0.0.9 - a total of 6 .cpp files and 3 .h files. So Gentoo ended up rebuilding the entire firefox source tree in order to accommodate 9 patches (totaling only a few hundred lines).</p>
<p>For a long-term project (not to be completed during the semester, but to be kept under considering while working on the build system), I would like to investigate updating Gentoo to make use of the Updater program. Updating the system could then involve downloading a set of patches, applying them to an already-unpacked and built source tree, and running the Updater. This would avoid the redundant compilations, at the cost of extra disk space.</p>
</body>
</html>
